Skip to content

Conversation

@Matt711
Copy link
Contributor

@Matt711 Matt711 commented May 13, 2025

Description

Contributes to #16483 by adding fast paths to DataFrame.to_cupy (which is called when DataFrame.values is called). The PR follows up #18450 to add cython bindings for cudf::table_to_array to pylibcudf and plumbs those changes through cudf classic.

I benchmarked the fast (True) and slow (False) when the dataframe has 1, 6, 20, and 100 columns. The fast paths use cudf::table_to_array if the number of columns is greater than 1 and cp.asarray directly if the dataframe has only one column. The slow path uses a raw python loop + assignment to create the cupy array.
image

I used the median because the CUDA overhead of calling cudf::table_to_array is large (so there are outliers in the times). Here is a profile of calling to_cupy twice for both the slow and fast paths.
Screenshot from 2025-05-13 12-23-46

In the first calls, the fast path takes 7.3 ms vs 4.8 ms for the slow path. The first call to cudf::table_to_array is the bottleneck. But if you compare the second calls, the fast path is much faster (79 us vs 2.3ms)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 requested a review from a team as a code owner May 13, 2025 18:45
@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels May 13, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels May 13, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python May 13, 2025
@bdice
Copy link
Contributor

bdice commented May 13, 2025

@Matt711 Can do you some benchmarks on large-ish square tables? Like 1000x1000 or 10,000x10,000.

@Matt711
Copy link
Contributor Author

Matt711 commented May 14, 2025

@Matt711 Can do you some benchmarks on large-ish square tables? Like 1000x1000 or 10,000x10,000.

image

@Matt711 Matt711 requested a review from bdice May 14, 2025 19:38
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small fixes, then LGTM.

@Matt711 Matt711 added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 14, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented May 15, 2025

/merge

return _index_from_data(data, self.name)

@_performance_tracking
def to_pylibcudf(self, copy=False) -> tuple[plc.Column, dict]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs failed saying Index.to_pylibcudf was missing. I went ahead and added both to_plc and from_plc in this PR. They are almost identical to Series.from/to_plc. In a follow up PR, I can centralize these methods since both Index and Series are SingleColumnFrames

@rapids-bot rapids-bot bot merged commit 7694248 into rapidsai:branch-25.06 May 15, 2025
123 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants